Static Compression for Dynamic Texts
نویسندگان
چکیده
Two problems arise when semi-static word-based compression methods are applied to large texts, such as those stored in information retrieval systems. First, the space required for the model during decoding can become very large. Second, the need to handle document insertions means that the collection must be periodically recompressed if compression eeciency is to be maintained. Here we show that with careful management the impact of both of these drawbacks can be minimised. Experiments with a word-based model and over 500 Mb of text show that compression rates can be retained even in the face of severe memory limitations on the decoder, and in the face of signiicant expansion in the size of the text itself. The use of a word-based zero-order compression model to represent English text has been considered by several authors 2, 6, 7, 15]. It is particularly appropriate for compressing full-text document collections, an application in which very large quantities of text are stored, but individual documents must be independently decodable 1]. In this case a semi-static approach that makes a preliminary pass over the text before compressing it in a second pass is appropriate, and when the word-based model is coupled with static Huuman coding, it yields the three necessary properties: good compression, fast decoding, and random access 10]. There are, however, two drawbacks to the use of this combination. First, a great deal of decode-time memory space might be required to store the tokens of the compression alphabet. Experience has shown that the number of distinct \words" in a text grows as an almost linear function of its size, without the tailing-oo eeect often predicted. If for no other reason, this happens because random spelling mistakes occur at a reasonably constant rate, and these are all regarded as novel words by the compression system. As part of the international trec information retrieval experiment we have been dealing with several corpora of English text, each of approximately 500 Mb 5]. One of the collections is several years of articles from the Wall Street Journal, and this 508.15 Mb wsj database uses 289,101 distinct words totalling 2,159,044 characters; and 8,912 distinct non-words requiring in total 77,882 bytes. Allowing a 4-byte string pointer for each word, and ignoring for the moment
منابع مشابه
Mechanical properties of CNT reinforced nano-cellular polymeric nanocomposite foams
Mechanics of CNT-reinforced nano-cellular PMMA nanocomposites are investigated using coarse-grained molecular dynamics simulations. Firstly, static uniaxial stretching of bulk PMMA polymer is simulated and the results are compared with literature. Then, nano-cellular foams with different relative densities are constructed and subjected to static uniaxial stretching and obtained stress-strain cu...
متن کاملThe response of nucleus pulposus cell senescence to static and dynamic compressions in a disc organ culture
Mechanical stimuli obviously affect disc nucleus pulposus (NP) biology. Previous studies have indicated that static compression exhibits detrimental effects on disc biology compared with dynamic compression. To study disc NP cell senescence under static compression and dynamic compression in a disc organ culture, porcine discs were cultured and subjected to compression (static compression: 0.4 ...
متن کاملStatic compression down-regulates N-cadherin expression and facilitates loss of cell phenotype of nucleus pulposus cells in a disc perfusion culture
Mechanical compression often induces degenerative changes of disc nucleus pulposus (NP) tissue. It has been indicated that N-cadherin (N-CDH)-mediated signaling helps to preserve the NP cell phenotype. However, N-CDH expression and the resulting NP-specific phenotype alteration under the static compression and dynamic compression remain unclear. To study the effects of static compression and dy...
متن کاملA new method for multi-oriented graphics-scene-3D text classification in video
Text detection and recognition in video is challenging due to the presence of different types of texts, namely, graphics (video caption), scene (natural text), 2D, 3D, static and dynamic texts. Developing a universal method that works well for all these types is hard. In this paper, we propose a novel method for classifying graphics-scene and 2D-3D texts in video to enhance text detection and r...
متن کاملEffects of Carbon on the Microstructure and Hot Deformation Behavior
The effects of carbon content on the dynamic and static softening mechanisms of Ti microalloy steels were investigated both in ferrite and austenite regions. The results obtained showed that recrystallization rate decreased as the percentage of carbon content was increased from 0.0035 to 0.110. This is due to the Ti(CN) and TiC precipitates and also the free carbon content. In the ferrite reg...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994